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Abstract 


In most kernel based online learning algorithms, when an incoming instance is misclassified, it 
will be added into the pool of support vectors and assigned with a weight, which often remains 
unchanged during the rest of the learning process. This is clearly insufficient since when a new 
support vector is added, we generally expect the weights of the other existing support vectors to 
be updated in order to reflect the influence of the added support vector. In this paper, we propose 
a new online learning method, termed Double Updating Online Learning, or DUOL for short, 
that explicitly addresses this problem. Instead of only assigning a fixed weight to the misclassified 
example received at the current trial, the proposed online learning algorithm also tries to update the 
weight for one of the existing support vectors. We show that the mistake bound can be improved 
by the proposed online learning method. We conduct an extensive set of empirical evaluations for 
both binary and multi-class online learning tasks. The experimental results show that the proposed 
technique is considerably more effective than the state-of-the-art online learning algorithms. The 
source code is available to public at http: //www.cais.ntu.edu.sg/~chhoi/DUOL/. 
Keywords: online learning, kernel method, support vector machines, maximum margin learning, 
classification 


1. Introduction 


Online learning has been studied extensively in the machine learning community (Rosenblatt, 1958; 
Freund and Schapire, 1999; Kivinen et al., 2001; Crammer et al., 2006; Cesa-Bianchi and Lugosi, 
2006). In general, for a misclassified example, most of the kernel based online learning algorithms 
will simply assign to it a fixed weight that remains unchanged during the whole learning process. 
Although such an approach is advantageous in computational efficiency, it has significant limita- 
tions. This is because when a new example is added to the pool of support vectors, the weights 
assigned to the existing support vectors may no longer be optimal, and should be updated to reflect 
the influence of the new support vector. We emphasize that although several online algorithms are 
proposed to update the example weights as the learning process proceeds, most of them are not de- 
signed to improve the classification accuracy. For instance, in Orabona et al. (2008) and Crammer 
et al. (2003); Dekel et al. (2008), online learning algorithms are proposed to adjust the example 
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weights in order to fit in the constraint on the number of support vectors; in Kivinen et al. (2001), 
example weights are adjusted to deal with the drifting concepts. 


Motivated by the above observations, we propose a new strategy for online learning that explic- 
itly addresses this problem. It is designed to dynamically tune the weights of support vectors in 
order to improve the classification performance. In some trials of online learning, besides assign- 
ing a weight to the misclassified example, the proposed online learning algorithm also updates the 
weight for one of the existing support vectors, referred to as auxiliary example. We refer to the 
proposed approach as Double Updating Online Learning (Zhao et al., 2009), or DUOL for short. 


The key challenge in the proposed online learning approach is to decide which existing support 
vector should be selected for updating weight. An intuitive choice is to select the existing support 
vector that "conflicts" with the new misclassified example, that is the existing support vector which 
on the one hand shares similar input pattern as the new example and on the other hand belongs 
to a class different from that of the new example. In order to quantitatively analyze the impact of 
updating the weight for such an existing support vector, we employ an analysis that is based on the 
work of online convex programming by incremental dual ascent (Shalev-Shwartz and Singer, 2006, 
2007). Our analysis shows that under certain conditions, the proposed online learning algorithm 
can significantly reduce the mistake bound of the existing online algorithms. Besides binary classi- 
fication, we extend the double updating online learning algorithm to multi-class learning. Extensive 
experiments show promising performance of the proposed online learning algorithm compared to 
the state-of-the-art algorithms for online learning. 


The rest of this paper is organized as follows. Section 2 reviews the related work for online 
learning. Section 3 presents the proposed “double updating" approach for online learning of binary 
classification problems. Section 4 extends the double updating method to online multi-class learn- 
ing. Section 5 gives our experimental results. Section 6 discusses the possible directions to explore 
in the future. Section 7 concludes this work. 


2. Related Work 


Online learning has been extensively studied in machine learning (Rosenblatt, 1958; Crammer and 
Singer, 2003; Cesa-Bianchi et al., 2004; Crammer et al., 2006; Fink et al., 2006). One of the most 
well-known online approaches is the Perceptron algorithm (Rosenblatt, 1958; Freund and Schapire, 
1999), which updates the learning function by adding the misclassified example with a constant 
weight to the current set of support vectors. Recently a number of online learning algorithms have 
been developed based on the criterion of maximum margin (Crammer and Singer, 2003; Gentile, 
2001; Kivinen et al., 2001; Crammer et al., 2006; Li and Long, 1999). One example is the Relaxed 
Online Maximum Margin algorithm (ROMMA) (Li and Long, 1999), which repeatedly chooses the 
hyper-planes that correctly classify the existing training examples with a large margin. Another 
representative example is the Passive-Aggressive (PA) algorithm (Crammer et al., 2006). It updates 
the classification function when a new example is misclassified or its classification score does not 
exceed the predefined margin. Empirical studies showed that the maximum margin based online 
learning algorithms are generally more effective than the Perceptron algorithm. Despite the differ- 
ence, most online learningalgorithms only update the weight of the newly added support vector, and 
keep the weights of the existing support vectors unchanged. This constraint could significantly limit 
the performance of online learning. 
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The proposed online learning algorithm is closely related to the recent work of online convex 
programming by incremental dual ascent (Shalev-Shwartz and Singer, 2006, 2007). Although the 
idea of simultaneously updating the weights of multiple support vectors was mentioned in Shalev- 
Shwartz and Singer (2006, 2007), neither efficient algorithm nor theoretical result was given explic- 
itly in their work. Besides, our work is also related to budget online learning (Weston and Bordes, 
2005; Crammer et al., 2003; Cavallanti et al., 2007; Dekel et al., 2008) and online learning for drift- 
ing concepts. Although these online learning algorithms are capable of dynamically adjusting the 
weights of support vectors, they are designed to either fit in the budget for the number of support 
vectors or to handle drifting concepts, but not to reduce the number of classification mistakes in 
online learning. 

Finally, several algorithms were proposed for online training of SVM that update the weights 
of more than one support vectors simultaneously (Cauwenberghs and Poggio, 2000; Bordes et al., 
2005, 2007; Dredze et al., 2008; Crammer et al., 2008, 2009). In particular, in Bordes et al. (2005, 
2007), the authors proposed to update the weights of two support vectors simultaneously at each 
iteration, similar to the proposed algorithm. These algorithms differ from the proposed one in that 
they are designed for efficiently learning an SVM classification model, not for online learning, and 
therefore do not provide guarantee for mistake bound. 


3. Double Updating Online Learning for Binary Classification 


In this section, we present the proposed double updating online learning method for solving online 
binary classification tasks. Below we start by introducing some preliminaries and notations. 


3.1 Preliminaries and Notations 


We consider the problem of online classification. Our goal is to learn a function f : R? — R based on 
a sequence of training examples { (x1,y1),..., (xr, yr), where x, € В is a d-dimensional instance 
and y, € = {—1,+1} is the class label assigned to x,. We use sign(f(x)) to predict the class 
assignment for any x, and | f (x)| to measure the classification confidence. Let /(f(x),y) : Rx — К 
be the loss function that penalizes the deviation of estimates f(x) from observed labels y. We refer 
to the output f of the learning algorithm as a hypothesis and denote the set of all possible hypotheses 
by H = (f|f : В — R}. 

In this paper, we consider H a Reproducing Kernel Hilbert Space (RKHS) endowed with a 
kernel function k(.,-) : R? x R? — R (Vapnik, 1998) implementing the inner product(.,-) such that: 
1) x has the reproducing property (f, k(x,-)) = f(x) for x € R4; 2) H is the closure of the span of 
all k(x,-) with x € Rf, that is, k(x,-) € H for every x € X. The inner product (-,-) induces a norm 
on f € H in the usual way: ||| :— (f, f)2. To make it clear, we use к to denote an RKHS with 
explicit dependence on kernel function к. Throughout the analysis, we assume K(x,x) < 1 for any 
x € В. 


3.2 Motivation 


We consider trial t in an online learning task where the training example (ха, уа) is misclassified (i.e., 
yaf (x4) € 0)). Let D = ((xi, yi), i = L,...,n) be the collection of n misclassified examples received 
before the trial t. We also refer to these misclassified training examples as "support vectors". We 
denote by & = (0,...,04) € (0,C]" the weights assigned to the support vectors in D, where C isa 
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predefined constant. The resulting classifier, denoted by f(x), is given by 
n 
f(x) = У ouvikGu xi). 
i=l 


In the conventional approach for online learning, we simply assign a constant weight, denoted by 
B € (0,C], to (ха, уа), and the resulting classifier becomes 


Ух) = Byak (xa) + Y, жий = Вуак(х,ха) + f9). 


i-l 


The shortcoming of the conventional online learning approach is that the introduction of the new 
support vector (ха, уа) may harm the classification of existing support vectors in D, which is re- 
vealed by the following proposition. 


Proposition 1 Let (ха, уа) be an example misclassified by the current classifier f(x) = 
У у iyiK(x,x;) with a; > 0,i = 1,...,n, that is, yaf(xa) < 0. Let f'(x) = Byak(x,xa) + f(x) 
be the updated classifier with B > 0. There exists at least one support vector x; € Ф such that 


yif (xi) > yif Оч). 
Proof It follows from the fact that: 3 х; € D, yiysK(x;, ха) < 0 when yaf (Xa) < 0. E 


As indicated by Proposition 1, when a misclassified example (xa, уа) is added to the classifier, the 
classification confidence of at least one existing support vector will be reduced. When yaf (xa) < —Y, 
there exists one support vector (хь,уь) € D that satisfies Byaypk(Xa, xp) < —By/n. This support 
vector will be misclassified by the updated classifier f'(x) if yof (хь) < By/n. In order to alleviate 
this problem, we propose to update the weight for the existing support vector whose classification 
confidence is significantly affected by the new misclassified example. In particular, we consider a 
support vector (xp, y») € D for weight updating if it satisfies the following two conditions: 


e ypf (xy) € 0, that is, support vector (xp, уь) is misclassified by the current classifier f (x); 


e К(хь,ха)Ууауь € —p where p € (0,1) is a predetermined threshold, that is, support vector 
(xp, yp) “conflicts” with the new misclassified example (хи, уа). 


We refer to the support vector satisfying the above conditions as an auxiliary example. It is clear 
that by adding the misclassified example (ха, уа) to classifier f(x) with weight p, the classification 
score of (хь, уь) will be reduced by at least Bp, which could lead to a significant misclassification of 
the auxiliary example (x5, уь). To avoid such a mistake, we propose to update the weights for both 
(xa, ya) and (xp, yp) simultaneously. In the next section, we show the details of the double updating 
algorithm for online learning, and the analysis for mistake bound. 

Our analysis follows closely the previous work on the relationship between online learning and 
the dual formulation of SVM (Shalev-Shwartz and Singer, 2006, 2007), in which the online learning 
is interpreted as an efficient updating rule for maximizing the objective function in the dual form 
of SVM. We denote by A; the improvement of the objective function in dual SVM when adding 
a misclassified example to the classification function at the t-th trial. According to Theorem 1 in 
Shalev-Shwartz and Singer (2006), if an online learning algorithm А is designed to ensure that for 
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all t, A; is bounded from below by a bounding constant A, then the number of mistakes made by 
A when trained over a sequence of trials (x1,y1),..., (xr, yr), denoted by M, is upper bounded by 


E. o ү 
М < A (pio Si +e routs) ’ 


where /(y;f (xj)) = max(0, 1 —y;f(x;)) is the hinge loss function. According to Shalev-Shwartz and 
Singer (2006, 2007), the bounding constant A = 1/2 when we only update the classifier with the 
newly misclassified example. In our analysis, we will show that A can be significantly improved 
when updating the weights for both the misclassified example and the auxiliary example. 

For the remaining part of this section, we denote by (x5, уь) an auxiliary example that satisfies 
the two conditions specified before. We define 


Ка = K(xa, Xa), kp = K(xp,xp), Kab = К(ха,хь), Wab = YaYbkab- 


According to the assumption of auxiliary example, we have wap = kabYayb € —p. Finally, we denote 
by ў, the weight for the auxiliary example (xj, yp) that is used in the current classifier f(x), by Ya 
and yp the updated weights for (ха, уа) and (xp, yp), respectively, and by dy, the difference Yp —Yp. 


3.3 Double Updating Online Learning for Binary Classification 


Recall an auxiliary example (xp, уь) should satisfy two conditions (I) yp f (хь) < 0, and (ID) wap € —p. 
In addition, the example (ха, уа) received in the current iteration t is misclassified, that is, ya f (xa) < 
0. Following the framework of dual formulation for online learning, the following lemma shows 
how to compute A,, that is, the improvement in the objective function of dual SVM by adjusting 
weights for (ха, уа) and (xp, уь). 


Lemma 1 7he maximal improvement in the objective function of dual SVM by adjusting weights 
for (ха,уа) and (xp,yp), denoted by А, is computed by solving the following optimization prob- 
lem(which is a special case of the optimization problem (28) in Shalev-Shwartz and Singer, 2006): 


A = max {h(Ya,dy,) :0< Ya <C, -% < dy, < с-%)} a) 
ay Yb 


where 





ka > kp 
"а 5i Wap fad, 


The lemma follows directly the dual formulation of SVM. The theorem below bounds the bounding 
constant A when C is sufficiently large. 


h(ya; d», ) = ү.(1 — Yaf (ха)) +dy,(1 — yof (xp)) 


Theorem 1 Assume C > Yp 4- 1/(1 — p) with p € [0,1) for the selected auxiliary example (xp, yp), 


we have the following bound for the bounding constant ^: 
Re. 
ie 
Proof First, we show dy, > 0. This is because for given Ya > 0, the optimal solution for dy,, given 
by 
= 1 — yb f (xp) — WabYa 
kp ? 





dy, 
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is positive because уь/ (хь) < 0 and wap < —p. Using the fact ka,kp < 1, Ya, dy, > 0, Yaf (Xa) < 0, 
yb f (xp) € 0, and wap < —p, we have 


1 1 
h(ya,dy,) > Ya + dy, — yt - 54, + pyady,. 


Thus, A is bounded as 





1 
А> at d Pin dy | ad, « 
i wel. € (0,C—%] Y % 5 (тг i) р айу, 


Under the condition that C > ўр + 1/(1 — р), it is easy to verify that the optimal solution for the 
above problem is Ya = dy, = 1/(1 — p), which leads to the result in the theorem. ш 


We refer to the case as a strong double update when the condition of Theorem 1 is satisfied. We 
have the following theorem for the general case when we only have C > 1. 


Theorem 2 Assume C > 1. We have the following bound for ^ when updating the weight for the 
misclassified example (x4,y4) and the auxiliary example (xp, yp): 


11 m 
A> 5 * min (1-9), (C-3)). 


Proof By setting Ya = 1, we have h(Yqa,dy,) computed as 


— 


1 
A(Ya = 1,dy,) > 2 t (1 p)dy, — 54 


Hence, А is lower bounded by 


: 1 1 
А> = Е ЕЕЕ 5 24, 
2 2 аах я (acm, d ) 2 z + 5 min((1 +P) ‚(с-?) 


Although Theorem 1 and 2 show that the double update strategy could significantly improve 
the bounding constant A over 1/2 and consequentially reduce the mistake bound, it is applicable 
only when there exists an auxiliary example. Below, we extend the double update strategy to the 
cases when there is no auxiliary example. Specifically, we relax the condition for performing double 
update as follows: there exists (xp, yp) € D that (i) wap < —p, (ii) yp %—1(хь) < 1, and (iii) C > y, +p. 
We refer to these cases as weak double update. 


Theorem 3 Assume Wap < —p, Yb fi—1ı (xp) € 1 and C > $ +p, we have the following bound for the 
bounding constant 








1 
А= һ(үл, > Һ(1,р) 21 | 
mos (Ya, dy,) > h(1,p) ;T9 
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Algorithm 1 The Double Updating Online Learning Algorithm (DUOL) 
PROCEDURE 21: for Vi c S, do 
1: Initialize Sọ = 0, fo = 0; 22: Ж = fia yny(xix) 
2: for t=1,2,...,T do + yilYp = VW )ypK(Xi, хь); 
3 Receive a new instance X; 23: end for 
4: Predict ӯ, = sign(f,..1(x;)); 24: fe = fi-1 +үук(х;,:) 
5: Receive its label y;; +(Yp — Yo) yp (Xp, -); 
6: l; = max{0,1—yrfr_-1(xr)} 25: else /* no auxiliary example found */ 
7 if /, > 0 then 26: y, = тіп(С, 0, /K(xi,xj)); 
8: Wmin = 99; 2T: for Vi c S, do 
9: for Vi c 5—1 йо 28: fi + ү -FyifoyrK(Xi , Xi); 
10: if (f/ , < 1) then 29: end for 
11: if (yiyrK(Xi, Xr) < Wmin) then 30: fi = fr-1 Wyrk(xis +); 
12: Wmin = yiyiK(Xi, Xi); 31: end if 
13: (Xp, yp) = (х;,уг); 32: else 
14: end if 33: ft —f-58 = %—1) 
15: end if 34: for Vi c S, do 
16: end for 35: fcf 
17: Ё у= yfi 3r); 36: end for 
18: 5; = S 1U (tk 37: end if 
19: if (илл € —p) then 38: end for 
20: Compute ү, and ү by solving return fr, Sr 
the optimization (1) END 











Figure 1: The Algorithms of Double Updating Online Learning (DUOL). 


Solving the optimization problem (1) is the key to the double update. The following proposition 
provides the optimal solution to the problem (1). 


Proposition 2 Denote l4 :— 1 — yaf (ха) and £y :— 1 — yp f (xp). Assume la, lp > 0, ka,kp > 0 and 
Wap < 0, then the solution of optimization problem (1) is as follows: 


(CC — 4) if (kaC + was (C — 15) — ba) < 0 and (Ы(С—%ь) + wabC — Lo) < 0 

Й (С, £p— Pa as) if wape— нео akpC+kpla > 0 and = qe є] 4»,C | 
(Yard) = асы ca) if =al е (0,C] and б ЫС) wo ta so 7 
( 


£ " la kp£, lp Ка pla ^ А 
меса, (а ае сс] 














kakp— ИЕ БОКУ =w% 


The detailed proof for Proposition 2 can be found in Appendix A. Figure 1 summarizes the proposed 
Double Updating Online Learning (DUOL) algorithm. In this algorithm, to efficiently find the 
auxiliary example (xp, yp), we introduce a variable ff for each support vector to keep track of its 
classification score. Parameter p is used to trade off between efficiency and efficacy for DUOL: the 
smaller p the more double updates will be performed. 

Finally, we give the mistake bound for the DUOL algorithm. We denote by ™ the set of indexes 
that correspond to the trials of misclassification, that is, 


М = {t |у, A sign(fi-iQu)), vt € [T]}. 
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In addition, we denote by (р) and MY (p) the sets of indexes for the cases of strong and weak 
double updating, respectively, that is, 





Ds 1 
Mi(p) = {t |3 auxiliary example (xp, yp) s.t. C > Yp + 1 


for (х,у), t € М}, 
MF (pP) = {t |A Qxp.yp) S.t. Wav € —P, yofi-i(xo) € 1 and C > 35 - p.t € М/ М (р)}. 


Note that in set (р), for the convenience of analysis, we only consider the subset of strong 
updates when the condition C > ў + 1/(1 — р) is satisfied. Finally, we denote the cardinalities of 
sets M, Mj. and My by M = ||, M5(p) = |Mj(p)]. Ms (p) = IM; (p)]. and M, = M — M5 (p) — 
Mj; (p), respectively. 


Theorem 4 Let (x1,y1),...,(xr,yr) bea sequence of examples, where x, ЄВ, y, € {—1, +1} and 
K(x;,x;) € 1 for all t, and assume C > 1. Then for any function f in Ж, the number of prediction 
mistakes M made by DUOL on this sequence of examples is bounded by: 





2 (min Tfl, CY ул) ) - EM) — +P мз) 
fct. 2 Ж = yi 1 2 а 1—р а , 
where p € |0, 1). 
Proof According to Theorem 1 and 3, we have 
1 1+ p? 
min A; ———, min Д; > FP ; 
re МЇ (р) 1—р :eaty(p) 2 


Moreover, according to Theorem 2, we have А, > 1/2, Vt € M. Putting them together, we have 


Ly, 60 uso) + stp) < | min Lr, CY orf) 
2Ms 2 d VP = атоо Ж ) 01 žili Je 





We complete the proof using М = M; +MY (p) + М (р). ш 


As revealed by the above theorem, the number of mistakes made by the proposed double updat- 
ing online learning algorithm will be smaller than the online learning algorithm that only performs 
a single update in each trial. The difference in the mistake bound is essentially due to the double 
updating, that is, the more the number of double updates, the more advantageous the proposed algo- 
rithm will be. Besides, the above bound also indicates that a strong double update is more powerful 
than a weak double update given that the associated weight of a strong double update (1--p)/(1— p) 
is always much larger than that of a weak double update p? /2. It is worthwhile pointing out that al- 
though according to Theorem 4, it seems that the larger the value of p the smaller the mistake bound 
will be. This however may not be true because М (р) in general decreases as p increases. Finally, 
we note that Theorem 4 bounds the number of mistakes made by the proposed DUOL algorithm 
for C > 1. When C < 1, the mistake bound for the proposed algorithm follows Theorem 2, 3 and 
Corollary 2 in Shalev-Shwartz and Singer (2007). 
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4. Multiclass Double Updating Online Learning 


In this section, we extend the proposed double updating online learning algorithm to multiclass 
learning where each instance can be assigned to multiple classes. 


4.1 Online Multiclass Learning 


Similar to online binary classification, online multiclass learning is performed over a sequence 
of training examples (x1,Y1),...,(xr,Yr). Unlike binary classification where y; € {—1,+1}, in 
multi-class learning, each class assignment Y, C Y = {1,...,k} could contain multiple class labels, 
making it a more challenging problem. We use Ў, to represent the class set predicted by the online 
learning algorithm. Before presenting our algorithm, we first review online multiclass learning 
(Crammer and Singer, 2003; Fink et al., 2006) based on the framework of label ranking (Crammer 
and Singer, 2005). 


4.1.1 LABEL RANKING FOR MULTICLASS LEARNING 


Given an instance x, the label ranking approach first computes a score for every class label in Y, 
and ranks the classes in the descending order of their scores. The predicted class set Ў, is formed by 
the classes with the highest scores. The objective of label ranking is to ensure that the score of class 
r is significantly larger than that of class s if r Є Y, is a true class assignment while s Є Y VY, is not. 
An instance x is classified incorrectly if that above condition is NOT satisfied. 

We follow the protocol of multi-prototype (Vapnik, 1998; Crammer and Singer, 2001; Crammer 
et al., 2006) for the design of multiclass multilabel learning algorithm. It learns multiple hypothe- 
ses/classifiers, one classifier for each class in Y, leading to a total of k classifiers that are trained for 
the classification task. Specifically, for trial t, upon receiving an instance x;, the scores of k classes 
output by the set of k hypotheses are given by 


fie) = (out eos 


where f;.1; € Hx,i=1,...,k. We introduce two variables г; and s; that are defined as follows: 
r, —argminf, 1,(x,)) ands, = аг тах f; s(x), (2) 
rey, sgY, 


here, ғ; and s, represent the class of the smallest score among all relevant classes and the class of the 
largest score among the irrelevant classes, respectively. Using the notation of r; and s;, the margin 
with respect to the hypothesis set f;.. at trial t is defined as follows: 


Г(Ж—1; (xi, Y) = Л—1,һ (xr) = fete (х). 


Based on the notation of classification margin, we define the loss function of hypotheses f;..; (x) for 
training example (x;, Y;) as follows: 


(fiii 05%) — max [1 = (fiis) = fr-is(%))] |. 


r€Y,,sY, 


where [x], = max(0,x). 
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4.1.2 A PERCEPTRON ALGORITHM FOR ONLINE MULTICLASS LEARNING 


According to Crammer and Singer (2003), when an example is misclassified at trial t£, we update 
each component of the classifier f;..; as follows: 


fux) = fia) + Oy, GK Qa x), Vi € 9^, (3) 
where ү, € (0,C], and function oy, (i,t), which is simplified as o(i,t), is defined below: 
1 if i =”, 
olit)=} —1 ifi2s 


0 otherwise 


Using notation Н(У,) = (6(1.1),--- ,с(к,1))”, we rewrite Equation (3) as f(x) = fi i1(x)4- 
YH (Y; )K(x;, х), or equivalently 


=F uH K(xi,x 


where n is the number of support vectors received so far. 


4.2 Multiclass DUOL Algorithm 


We extend the DUOL algorithm to multiclass learning. We denote by (ха, У) the misclassified 
example received at the current trial, that is, (f (x4)),, — (f(%a))s, < 0. Similar to DUOL for binary 
classification, we introduce an auxiliary example (хь, Yp) from the existing support vectors that obey 
the following conditions: 


1. (Р (хь)),, — (CF (xp))s, € 0, that is, (xp, Yp) is misclassified by current classifier f; 


2. (Н(Үл):Н(Үь))к(ха,хь) € —2p where p € (0,1) is a threshold. This property indicates that 
example (ха, Ya) conflicts with example (хь, Yp). 


Compared to auxiliary example defined for binary classification, we introduce H(Y;) - H(Y;) in 
above when defining two conflicting instances. Given K(xa,xp) > 0, the second condition of aux- 
iliary example implies H(Y;) - H(Y;) < 0, which further indicates that two examples (ха, Үг) and 
(xp, Yp) have the opposite prediction, that is, (rg = Sp) or (s; = rp). This result is revealed by the 
following proposition. 


Proposition 3 The inequality H(Y,)-H(Yp) < 0 holds if and only if (ға = Sp) or (Sq = rp). 


The proof of Proposition 3 is given in the appendix. 

Similar to the DUOL algorithm for binary classification, our analysis aims to show that by 
updating weights for both misclassified example and the auxiliary example, we may be able to 
significantly improve the bounding constant A, which is defined as follows: 


M xA (minrG суи fX) ), (4) 
єк 


where Ж = ПС Ж and F(f) Y, Al fill, To ease our further discussions, we define kg = 


K(xa, Xa), kp = K(Xp; Xp), Wab = (Н (Ya) “Н(Ү))к(ха,хь) : 
The following proposition shows the optimization problem related to the multiclass double up- 
dating online learning algorithm, which forms the basis for deriving the bounding constant A. 
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Proposition 4 With the double updating, that is, adjusting the weight of some auxiliary support 
vector (Xp, Y) from ў» to Yp (denoted by dy, = Yp — w») and assigning weight Yq to the current mis- 
classified example (Xa,Ya), the improvement in the objective function of dual SVM, denoted by A,, is 
computed by the following optimization problem: 


тах Ya(1— (fcis (a) = i-is 09))) +@(1— (лыб) — i-is 09))) 


Ya,dy, 
Кү NE kod, = WabYay, ; (5) 
s.t. 0 < Ya C Vw. < dy, Сүр, 


Theorem 5 Assume к(х,х) < 1 for any x and C > + Xi-p) 5) Jor the selected auxiliary example 
(xp, Yb), we have the following bound for A: 


1 
AN ————. 
> 2(1—p) 


We refer to mg case as a strong double update when there exists a auxiliary example (xp, Yp) s.t. 
C 2 4+ Xi-pj =? Similar to double updating for binary classification, we introduce weak double 
update when there exists (xp, Yp) s.t. Wap < —2P, fi is (Xp) — fias (xp) € 1, and C > + 5. 


Theorem 6 Assume there exists (xy, Yp) s.t. wap < —2р, fi i (xb) — fias (Xp) € 1, C >% +5 
and the current instance is misclassified, then we have the following bounding constant 


1+p? 
A> 
T 4 
The exact solution to the Quadratic Programming (QP) problem in (5) is given by the following 


proposition. 


Proposition 5 Denote la :— 1 — (fi 15, (xa) — А-1, (Ха)) and lp := 1 — (ft—1,7, (Xb) = —1„ь(%ь)). 
Assume la, lp > 0, k;, ky > О and wap < О, then the solution of optimization (5) is as follows: 





(C, C— VW) if (2k,C + wap(C — %ь) — £4) < 0 and (2kp(C — 5) + wapC — ty) «0 
(y d js i i Jk [ wC %] К 
а, ^, (fama (C7) apl (c- ^») ‚С—%) if ia aC 709) € [0,C] and £, — 2k(C — 5) — was awalh) >0 
( 


эы А П dal Wabla N +p /2kpla—Wablp 2kaly—Wapl, А А 
D LOC, та атаа) € 0,C] x [-5,C — Yo] 


2 
ГРЕЕ њас) if Ww. p,C—Wapln—4kakpC+2kp la >0 and eae 














Aksky—w2, ? 4kaky—w2, 


We skip the proof due to its high similarity to that of Proposition 2. Figure 2 summarizes the steps 
of the multiclass DUOL (M-DUOL) algorithm. Note that we replace the conditions for auxiliary 
example with the margin error in order to make more double updates. 

A mistake bound for the M-DUOL algorithm, similar to Theorem 4, is given by the following 
theorem. 


Theorem 7 Let (xi,Yi),...,(xr,Yr) be a sequence of examples, where x, € IR", Y, C Y and 
K(xi,x;) € [0,1] for all i,j. And assume C > 1. Then for any function f € [E ., He, the number 
of prediction mistakes M made by M-DUOL on this sequence of examples is bounded by: 


ZB 


p? 
4( min F(A) Asche (ау, и)))—Мр)-т— 
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Algorithm 2: The Multiclass DUOL Algorithm (M-DUOL) 
PROCEDURE 21: for Vi € 5, do 
1: Initialize Ho = 0, Sp = 0, fo = 0; 22; fic Fi + [y *Н(Ү,) хк(х,,х;)]. Н(Ү;) 
2: for t=1,2,...,T do + [Cyp — 95) * H (Yp) *k(xp,xi)]- H(Y;); 
3 Receive a new instance x; 23: end for 
4: Predict W,—1 = fr_1 (xr); 24: fe = fi-1 +Y HO) *к(х,,-) 
5: Receive its label set Y, + (уь =) * H (Yp) жк(хь,:); 
6: 4 = [1-W,-1 H(Y)]- 25: else /* no auxiliary example found */ 
7 if 1, > 0 then 26: ү. = min(C, mem)? 
8: Wmin = 99; 27: for Vi € S, do 
9: for Vi € S;_ do 28: fi — fl + ly BD) Qux): Н(Ү,); 
10: if fa < 1 then 29: end for 
11: if (НК < Wmin then 30: f = Ж—-1+ү *Н(Ү,) *к(х,,); 
12: Wmin = Hk; 31: end if 
13: (X5, Yp) = (xi, Yi); 32: else 
14: end if 33: Ё = Л-1;›5$ = 5—1; Ha 
15: end if 34: for Vi € S, do 
16: end for 35: fi c fl; 
17: fLa4 =W-1-H(%); 36: end for 
18: S, =5 JU(r Н, = Н, 1 U{H(%)}; 37: endif 
19: if (илл < —2p) then 38: end for 
20: Compute ү, and ү by solving return fr, Sr, Hr 
the optimization (5) END 








Figure 2: Algorithms of multiclass double-updating online learning (M-DUOL). 


5. Experimental Results 


In this section, we evaluate the empirical performance of the proposed double updating online learn- 
ing algorithms for online learning tasks. We first evaluate the performance of DUOL for binary 
classification, followed by the evaluation of multiclass double updating online learning. 


5.1 Testbeds and Experimental Setup for Binary-class Online Learning 


We compare our technique with a number of state-of-the-art techniques, including the kernel Per- 
ceptron algorithm (Kivinen et al., 2001), the “КОММА? algorithm and its aggressive version “agg- 
ROMMA" (Li and Long, 1999) the ALMA,(a) algorithm (Gentile, 2001), and the 
Passive-Aggressive algorithms (“РА”) (Crammer et al., 2006). For PA, two versions of algorithms 
(PA-I and PA-II) are implemented as described in Crammer et al. (2006). Note that one may also 
compare with the online SVM algorithm (Shalev-Shwartz and Singer, 2006), which updates the 
weights for all support vectors in each trial. However, we do not include this baseline for compari- 
son because it is too computationally intensive to run on some large data sets. 


For the proposed DUOL algorithms, we implement three variants based on different solvers to 
the problem in (1): G) “DUOLapp,” that employs an approximate solution to (1), that is, ү, = 75 
and y, = 4 + e (ii)““DUOL?” that uses the exact solution to (1) given in Proposition 2, and (iii) 
"DUOL;je that first updates the weight for the misclassified example and then the weight for 


auxiliary example, as suggested in Shalev-Shwartz and Singer (2007) 
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We test all the algorithms on eight benchmark data sets from web machine learning repositories, 
which are listed in table 1. АП of the data sets can be downloaded from LIBSVM website,! UCI 
machine learning repository,” and MIT CBCL face data sets.? 











Data Set # examples | # features 
sonar 208 60 
splice 1,000 60 
german 1,000 24 
mushrooms 8,124 112 
dorothea 1.150 100,000 
spambase 4,601 57 
MITFace 6,977 361 
wa 24,692 300 

















Table 1: Binary-class data sets used in the experiments. 


To make a fair comparison, for all algorithms in comparison, we set C — 5 and use the same 
Gaussian kernel with o = 8. For the ALMA, (o) algorithm, parameter p and a are set to 2 and 0.9, 
respectively, based on our experience. For the proposed DUOL algorithm, we fix p to be 0 for all 
cases. АП the experiments are repeated 20 times, each with an independent random permutation of 
the data points. АП the results are reported by averaging over the 20 runs. We evaluate the online 
learning performance by measuring the mistake rate, that is, the percentage of examples that are 
misclassified by the online learning algorithm. We measure the sparsity of the learned classifiers 
by the number of support vectors. We evaluate computational efficiencies of all the algorithms in 
terms of their CPU running time (in seconds). All the experiments are run in Matlab over a windows 
machine of 2.3GHz CPU. 


5.2 Performance Evaluation for Binary-Class Online Learning 


Table 2 summarizes the performance of all the compared online learning algorithms over the binary 
data sets. We can draw several observations from the results. 

First, among the six baseline algorithms in comparison, we observe that the agg-ROMMA and 
two PA algorithms (PA-I and PA-II) perform considerably better than the other three algorithms 
(i.e., Perceptron, ROMMA, and ALMA) in most cases. We also notice that the agg-ROMMA 
and the two PA algorithms consume considerably larger numbers of support vectors than the other 
three algorithms. We believe this is because the agg-ROMMA and the two PA algorithms adopt 
more aggressive strategies than the other three algorithms, resulting in more updates and better 
classification performance. For the convenience of discussion, we refer to agg-ROMMA and two 
PA algorithms as aggressive algorithms, and the other three online learning algorithms as non- 
aggressive ones. 

Second, we observe that among the three variants of double updating online learning, the DUOL 
approach, which solves the optimization problem exactly, yields the least mistake rate with the 
smallest number of support vectors for most of the cases. Comparing with the baseline algorithms, 





1. LIBSVM website is http: //www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. 
2. UCI ML repository is at http: //www.ics.uci.edu/~mlearn/MLRepository.html. 
3. MIT CBCL face data sets can be found at http: //cbcl.mit.edu/software-datasets. 
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Algorithm sonar splice 

Mistake (96) Support Vectors (#) Time (s) Mistakes (%) Support Vectors (#) Time (s) 
Perceptron 38.125 + 3.815 79.30 + 7.93 0.004 27.120 + 0.975 271.20 + 9.75 0.017 
ROMMA 36.587 + 2.976 76.10 + 6.19 0.006 25.560 + 0.814 255.60 + 8.14 0.032 
agg-ROMMA 34.928 + 2.860 130.05 + 7.51 0.009 22.980 + 0.780 602.90 + 7.42 0.044 
ALMA» (0.9) 36.370 + 3.572 86.25 + 6.43 0.006 26.040 + 0.965 314.95 + 9.41 0.032 
PA-I 40.986 + 2.837 154.15 + 6.95 0.004 23.815 + 1.042 665.60 + 5.60 0.029 
PA-II 40.481 + 3.023 162.40 + 6.26 0.004 23.515 + 1.005 689.00 + 7.85 0.029 
DUOL iter 39.495 + 3.299 149.85 + 3.42 0.014 23.205 + 0.932 566.85 +13.08 0.097 
DUOLappr 41.010 + 2.335 162.25 + 5.01 0.013 21.945 + 1.134 721.85 + 9.10 0.095 
DUOL 34.255 + 2.811 137.60 + 6.99 0.017 20.875 + 0.868 577.15 £10.81 0.087 
Algorithm german mushrooms 

Mistake (%) Support Vectors (#) Time (s) Mistakes (%) Support Vectors (#) Time (s) 
Perceptron 34.760 + 0.947 347.60 + 9.47 0.019 2.083 + 0.278 169.25 + 22.58 0.148 
ROMMA 34.725 + 1.009 347.25 + 10.09 0.037 2.429 + 0.101 197.35 + 824 0.264 
agg-ROMMA 32.925 + 1.184 633.40 + 14.02 0.049 1.568 + 0.096 1307.90 + 39.59 0.576 
ALMA» (0.9) 33.480 + 0.681 394.75 + 9.24 0.036 2.538 + 0.297 304.80 + 38.02 0.267 
PA-I 33.010 + 1.025 721.10 + 12.99 0.031 1.661 + 0.089 1221.55 + 22.80 0.454 
PA-II 32.630 + 1.016 749.50 + 11.84 0.032 1.657 + 0.088 1326.20 + 22.85 0.483 
DUOL i; 35.985 + 1.077 714.35 + 12.75 0.125 1.537 + 0.101 860.05 + 23.00 0.52 
DUOLappr 30.275 + 0.937 716.10 + 10.44 0.096 1.459 + 0.101 1291.35 + 32.03 0.658 
DUOL 31.810 + 1.090 656.30 + 14.36 0.108 0.596 + 0.053 453.70 + 19.40 0.34 
Algorithm dorothea spambase 

Mistake (%) Support Vectors (#) Time (s) Mistakes (%) Support Vectors (#) Time (s) 
Perceptron 3.257 + 0.973 152.45 + 11.18 0.016 24.987 + 0.525 1149.65 + 24.16 0.215 
ROMMA 7.461 + 0.537 200.80 + 6.18 0.035 23.953 + 0.510 1102.10 + 23.44 0.275 
agg-ROMMA 17.435 + 0.500 438.30 + 13.83 0.044 21.242 + 0.384 2550.70 + 27.28 0.515 
ALMA, (0.9) 4.478 + 0.378 210.25 + 5.68 0.035 23.579 + 0.411 1550.15 + 15.65 0.348 
PA-I 7.500 + 0.491 461.30 + 15.80 0.026 22.112 + 0.374 2861.50 + 24.36 0.479 
PA-II 17.500 + 0.491 461.30 + 15.80 0.027 21.907 + 0.340 3029.10 + 24.69 0.504 
DUOL ie, 21.109 + 0.796 559.20 + 19.44 0.080 21.907 + 0.432 2511.20 + 34.14 1.215 
DUOLappr 7.500 + 0.491 461.30 + 15.80 0.054 20.185 + 0.351 2981.00 + 26.95 1.091 
DUOL 11.757 + 0.237 407.50 + 12.80 0.080 19.438 + 0.282 2494.95 + 26.19 1.069 
Algorithm MITFace w7a 

Mistake (%) Support Vectors (#) Time (s) Mistakes (%) Support Vectors (#) Time (s) 
Perceptron 4.665 + 0.192 325.50 + 13.37 0.207 4.027 + 0.095 994.40 + 23.57 3.392 
ROMMA 4.114 + 0.155 287.05 + 10.84 0.285 4.158 + 0.087 1026.75 + 21.51 1.875 
agg-ROMMA 3.137 + 0.093 1121.15 + 24.18 0.555 3.500 + 0.061 2318.65 + 60.49 3.257 
ALMA» (0.9) 4.467 + 0.169 400.10 + 10.53 0.297 3.518 + 0.071 1031.05 + 15.33 1.314 
PA-I 3.190 + 0.128 1155.45 + 14.53 0.439 3.701 + 0.057 2839.60 + 41.57 2.691 
PA-II 3.108 + 0.112 1222.05 + 13.73 0.463 3.571 + 0.053 3391.50 + 51.94 3.311 
DUOL iter 2.551 + 0.128 963.45 + 23.80 0.572 4.456 + 0.073 3048.85 + 54.53 4.566 
DUOLappr 2.687 + 0.140 1262.50 + 20.68 0.656 3.116 + 0.104 2908.95 + 28.65 3.679 
DUOL 2.151 + 0.106 697.95 + 13.17 0.445 2.914 + 0.045 2402.55 + 39.88 6.470 


























Table 2: Evaluation of online learning algorithms on the binary-class data sets. 
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we observe that DUOL achieves significantly smaller mistake rates than the other single-updating 
algorithms in all cases. This shows that the proposed double updating approach is effective in im- 
proving the performance of online prediction. By examining the number of support vectors, we 
observed that DUOL results in sparser classifiers than the three aggressive online learning algo- 
rithms, and denser classifiers than the three non-aggressive algorithms. 


Third, according to the results of running time, we observe that DUOL is overall efficient as 
compared with the state-of-the-art online learning algorithms. Among all the algorithms in compar- 
ison, Perceptron, due to its simplicity nature, is clearly the most efficient algorithm. Since DUOL 
requires double updates, it is less efficient than PA, ROMMA and ALMA algorithms, but is compa- 
rable to the agg-ROMMA algorithm. Note that the comparisons of running time costs are slightly 
different compared with the results in our previous conference paper (Zhao et al., 2009) because we 
did some improvements of efficiency for the implementations of some existing algorithms in this 
journal article. 


5.3 Evaluation of Different Auxiliary Example Selection Strategies and the Sensitivity to 
Parameter C for DUOL 


As the performance of DUOL quite relies on the choice of auxiliary examples, in this section, we 
evaluate different auxiliary example selection strategies. Specifically, we compare the proposed 
strategy to a random selection approach, referred to as “DUOL,ana”, which randomly chooses an 
auxiliary example from the existing support vectors. The exact solution to the problem in (1), given 
by Proposition 2, is used for updating the weights of both examples. We set p = 0 and o = 8 for all 
the data sets, same as the previous experiments. 


Figure 3 compares the online prediction performance between DUOL and DUOL,;;4 as well 
as the other competing algorithms with varied C values across eight different data sets. Several 
observations can be drawn from the results. 


First, it is clear to see that the proposed strategy for selecting auxiliary examples is more effec- 
tive than the random selection strategy for most cases. Second, among all the compared algorithms, 
we observe that DUOL always achieves the best performance when C is sufficiently large (e.g., 
C > 10), except for data sets "german" and “w7a” where a smaller C value tends to produce a better 
result. This observation is consistent to our previous theoretical result, which indicates setting a 
large C value usually implies more strong updates and consequently a better mistake bound. Third, 
we observe that the proposed DUOL algorithm is significantly more accurate than the other two 
variants of double updating online learning algorithms (DUOLite, and DUOLappr) for varied C val- 
ues, as we expected. We observe that DUOLit¢,, the iterative updating approach, performs unstably, 
which might be due to local optimum suffered from its heuristic update. This observation validates 
the importance of performing the optimal double updates by the proposed DUOL algorithm. 


5.4 Empirical Evaluation of Mistake Bounds 


To examine how the double updating strategy affects the mistake bound, we empirically compare M, 
the total number of mistakes made by the DUOL algorithm, М? (р), the number of mistake cases 
where the weak double updates are applied, and M$ (р), the number of mistake cases where the 
strong double updates are applied. Figure 4 shows the comparison between M, MY (p), and М (р) 
by varying p from O to 1. 
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Figure 3: Comparison between DUOL and DUOL,ana with varied C values. 
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First, we observe that double updates are frequently applied when p is small. This is because 
it is easier to find an auxiliary example for double updating when р is small. Further, we find that 
setting p close to 0 by default often leads to the best or close to the best results. Second, we observe 
that the number of weak updates is significantly larger than that of strong updates. This is because 
the condition of conducting a strong double update is significantly more difficult to be satisfied 
that that for a weak double update. Third, we observe that both MY (p) and М (р) monotonically 
decrease when increasing the value of p. In the extreme case, when p is close to 1, their value often 
drops to zero, indicating that no double update was applied. In the meantime, we find that the total 
number of mistakes often reaches the maximum, as р approaches 1. These results again validate the 
importance and effectiveness of the proposed double updating algorithm. 


5.5 Testbeds and Experimental Setup for Multiclass Online Learning 


Table 3 shows the multiclass data sets from Web machine learning repository used in our experi- 
ments. We compare the proposed M-DUOL algorithm with six state-of-the-art online learning algo- 
rithms. The first three algorithms are variants of Perceptron-based on methods studied in Crammer 
and Singer (2003). They are: (1) “Мах”, the perceptron method based on the max-score multiclass 
update, (ii) "Uniform", the perceptron method based on the Uniform multiclass update, and (iii) 
"Prop", the perceptron method based on the proportion multiclass update. We also compare the 
proposed algorithm with the other three state-of-the-art online multi-class learning algorithms, in- 
cluding the MIRA algorithm proposed by Crammer and Singer (2003), and the Passive-Aggressive 
(PA) algorithms, “PA-P” and “РА-П” proposed by Crammer et al. (2006). Similar to the experiments 
of binary classification, we implement three variants of the proposed M-DUOL algorithm based on 
different solvers to the problem in (5), that is, “M-DUOLapp,”, *M-DUOL", and “M-DUOLite,”. 
For all experiments, we use the Gaussian kernel with o — 8 and set C — 10. The threshold p in the 
proposed algorithms is set to 0 for all experiments. АП the experiments were repeated 20 time and 
the final results are averaged over 20 runs. 














data set # training examples | # classes | # features 
vehicle 846 4 18 
dna 2,000 3 180 
segment 2,310 7 19 
satimage 4,435 6 36 
usps 7,291 10 256 
mnist 10,000 10 780 
letter 15,000 26 16 
protein 17,766 3 357 

















Table 3: Multiclass data sets used in the experiments. 


5.6 Performance Evaluation for Multi-class Online Learning 


Table 4 summarizes the empirical performance for multi-class online learning. Several observations 
can be drawn from the experimental results. 

First, by comparing all the baseline algorithms, we find that the two PA algorithms yield con- 
siderably lower mistake rates than the other single-updating online learning algorithms. On the 
other hand, the classifiers learned by the three Perceptron-based algorithms (Max, Uniform, and 
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Figure 4: Empirical comparison of M, MY (p) and M$ (p) w.r.t. varied p € [0, 1] values. 
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Algorithm vehicle dna 

Mistake (96) Support Vectors (#) Time (s) Mistakes (%) Support Vectors (#) Time (s) 
Max 64.882 + 1.643 548.90 + 13.90 0.079 20.460 + 0.770 409.20 + 15.41 0.192 
Uniform 65.934 + 1.554 557.80 + 13.15 0.109 19.875 + 0.427 397.50 + 8.54 0.264 
Prop 66.678 + 1.757 564.10 + 14.86 0.116 20.268 + 0.555 405.35 + 11.10 0.267 
MIRA 62.252 + 2.114 526.65 + 17.89 1.821 26.920 + 0.880 538.40 + 17.61 5.304 
PA-I 67.086 + 1.479 781.70 + 12.42 0.091 15.503 + 0.474 1224.35 + 13.48 0.326 
PA-II 66.909 + 1.475 789.30 + 10.73 0.089 15.398 + 0.467 1237.50 + 13.12 0.325 
M-DUOL;;; 70.674 + 1.194 7158.05 + 8.65 0.162 11.668 + 0.599 1086.00 + 16.39 0.502 
M-DUOL appr 69.634 + 1.463 828.05 + 4.48 0.158 14.105 + 0.611 1281.75 + 14.44 0.495 
M-DUOL 51.950 + 1.948 719.25 + 10.95 0.172 10.340 + 0.513 869.80 + 12.61 0.438 
Algorithm segment satimage 

Mistake (%) Support Vectors (#) Time (s) Mistakes (%) Support Vectors (#) Time (s) 
Max 41.342 + 1.013 955.00 + 23.40 0.414 29.628 + 0.561 1314.00 + 24.89 0.826 
Uniform 41.468 + 0.550 957.90 + 12.71 0.566 28.440 + 0.398 1261.30 + 17.64 071 
Prop 41.589 + 0.714 960.70 + 16.48 0.565 28.878 + 0.467 1280.75 + 20.72 .087 
MIRA 35.784 + 3.770 826.55 + 87.08 9.193 27.536 + 2.228 1221.20 + 98.80 15.229 
PA-I 39.775 + 0.665 1852.75 + 19.90 0.573 27.377 + 0.361 2676.40 + 24.88 .296 
PA-II 39.842 + 0.655 1870.70 + 18.97 0.577 27.258 + 0.429 2709.50 + 23.77 307 
M-DUOL;;; 41.416 + 1.084 1787.90 + 31.00 0.903 33.894 + 0.567 2787.45 + 43.18 2.024 
M-DUOLappr 39.314 + 0.791 1923.60 + 14.31 0.871 26.222 + 0.464 3052.50 + 31.39 2.024 
M-DUOL 20.580 + 0.705 1265.15 + 28.39 0.693 22.524 + 0.482 2066.85 + 32.99 505 
Algorithm usps mnist 

Mistake (96) Support Vectors (#) Time (s) Mistakes (%) Support Vectors (#) Time (s) 
Max 10.025 + 0.195 730.90 + 14.21 1.459 15.318 + 0.168 1531.80 + 16.80 2.744 
Uniform 9.445 + 0.150 688.60 + 10.91 1.858 14.603 + 0.201 1460.25 + 20.15 3.631 
Prop 9.614 + 0.176 700.95 + 12.86 1.868 14.763 + 0.228 1476.30 + 22.78 3.635 
MIRA 11.572 + 0.403 843.75 + 29.39 44.663 18.037 + 0.539 1803.70 + 53.93 67.168 
PA-I 6.641 + 0.158 2528.45 + 23.48 2.669 11.026 + 0.208 4773.70 + 32.84 5.771 
PA-II 6.568 + 0.116 2561.95 + 27.94 2.606 10.959 + 0.238 4830.40 + 27.06 5.824 
M-DUOLiter 5.743 + 0.158 2284.15 + 40.06 3.160 8.947 + 0.182 4398.95 + 46.46 9.031 
M-DUOLappr 6.002 + 0.132 2725.40 + 23.55 3.541 9.640 + 0.164 5163.05 + 37.34 10.386 
M-DUOL 5.162 + 0.149 1759.30 + 23.44 2.408 8.282 + 0.183 3557.15 + 25.17 7.050 
Algorithm letter protein 

Mistake (%) Support Vectors (#) Time (s) Mistakes (%) Support Vectors (#) Time (s) 
Max 71.562 + 0.538 10734.35 + 80.63 18.749 47.657 + 0.221 8466.75 + 39.21 12.842 
Uniform 71.973 + 0.280 10795.90 + 41.99 47.031 46.828 + 0.272 8319.45 + 48.36 14.342 
Prop 72.033 + 0.273 10804.95 + 40.89 43.683 47.260 + 0.260 8396.15 + 46.13 14.620 
MIRA 67.709 + 1.196 10156.35 £179.54 467.019 47.905 + 0.922 8510.80 £163.74 42.174 
PA-I 72.283 + 0.338 14708.55 + 15.27 24.848 47.657 + 0.230 14153.25 + 49.06 23.409 
PA-II 72.339 + 0.380 14735.55 + 15.86 24.131 47.550 + 0.285 14285.85 + 44.94 23.602 
M-DUOL;;; 73.066 + 0.326 14614.65 + 22.26 210.684 50.070 + 0.392 14191.85 + 64.80 55.622 
M-DUOLappr 69.992 + 0.331 14892.70 + 11.77 215.587 51.459 + 0.582 16000.55 + 72.07 63.065 
M-DUOL 54.068 + 0.351 13140.40 + 37.33 186.452 46.281 + 0.418 12550.10 + 87.27 43.774 
































Table 4: Evaluation of multiclass online learning algorithms on the multiclass data sets. 
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Prop) and MIRA are considerably sparser than those learned by the two PA algorithms. We believe 
that this can be attributed to the aggressive updating strategies used by the PA algorithms. Second, 
among the three variants of double updating for multi-label learning, it is not surprising to observe 
that M-DUOL yields the lowest mistake rates for all data sets. Further, among all the algorithms, 
we observe that the M-DUOL algorithm makes the least number of mistakes for all data sets, and 
significantly outperforms all the baseline algorithms. 

Second, by examining the sparsity of classifiers learned by the proposed algorithms, we observe 
that the number of support vectors identified by M-DUOL is usually smaller than that of the PA 
algorithms (except for data set “vehicle’’), but is significantly larger than those of the four non- 
aggressive algorithms (1.е., Max, Uniform, Prop, and MIRA). 

Finally, comparing the running time cost, we observe that the Max algorithm is the most effi- 
cient one, while MIRA is the least efficient approach for all the data sets. Despite the additional 
time needed for double updates, overall we found that the running time of the proposed M-DUOL 
algorithm is comparable to those of the two PA algorithms (except for the “letter” data set where the 
time costs of the M-DUOL algorithms are considerably greater than those of the PA algorithms). 


6. Discussions and Future Directions 


Although encouraging results have been achieved by the proposed novel DUOL algorithms, we 
should address the limitations of our current work and discuss some research directions for future 
improvements. First of all, the proposed DUOL algorithm is based on the Passive Aggressive on- 
line learning algorithms (Crammer et al., 2006). For the future work, it is possible to extend other 
single update online learning methods, such as EG (Kivinen and Warmuth, 1995), for double up- 
dating. Second, the approach for choosing an auxiliary example from existing support vectors may 
be further improved by exploring the heuristics for measuring the informativeness of an example. 
Finally, we plan to extend the proposed double updating framework for budget online learning to 
make sparse classifiers. 


7. Conclusions 


This paper presented a novel “double updating" approach to online learning named as “DUOL”, 
which not only updates the weight of the misclassified example, but also adjusts the weight of one 
existing support vector that the most seriously conflicts with the new support vector. We show 
that the mistake bound for an online classification task can be significantly reduced by the proposed 
DUOL algorithms. We have conducted an extensive set of experiments by comparing with a number 
of algorithms for both binary and multiclass online classifications. Promising empirical results 
showed that the proposed double updating online learning algorithms consistently outperform the 
single-update online learning algorithms. 
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Appendix A. The Proof for Proposition 2 


Proof The optimization (1) can be rewritten to the following equivalent optimization: 


. Ка k 
min а + -b dà +WabYady, — bafa — lody, 


Yardy, gm 

s.t. Va — C € 0, (6) 
—Ya <0, (7) 
dy, — C + <0, (8) 
—dy, — f <9, (9) 


where Ка, Кр > 0, Wap € 0, la = 1 — yaf (Xa) 20, lp = 1—уь/ (хь) 2 0 and ў > 0. With Àj, А, 
Аз and À4 as Lagrange multipliers, the ККТ conditions for this problem consist of the constraints 
(6)-(9), the nonnegativity constraints A; > 0, Vi, the complementary slackness conditions 


М (Ya — C) = 0, Ao(—Yya) = 0, Aa (dy, — C +%) = 0, A4 (—dy, — 35) = 0 
and zero gradient conditions: 


Каа + Wabdy, — la +à — № = 0 and kpdy, + wapya — £p +A3 — № = 0. 





We will discuss every possible condition to compute the closed-form solution. Firstly, we will dis- 
cuss the case A; Æ 0: 


АЛ Case 1. If 4; 40 


Since № (y, — С) = 0, we have Ya = C; further, because A2(—y,) = 0, we have А = 0. Under the 
condition A, 4 0, we will discuss Аз 4 0 and Аз = 0 separately as follows: 


A.1.1 SUB-CASE 1.1. IFA3 40 


Since Лз[4, — (C —%ь)] = 0, we have dy, = C – ў», as a result А (С) = 0, so Aq = 0. Plugging the 
results үл = С, № = 0, dy, = C — and À4 = 0 into the zero gradient condition, we have 








kaC 4 Wab(C 4») la+ =0 and kp(C Vp) FwapC — £p Jj л = 0. 








Thus, we have 





№ = —|kaC + wa (C —%) ^4] and А = —[ko(C — 85) 4- wapC — £5]. 
As a result, if 


(КС +wap(C — 5) — £4) » 0 and (ky(C — 35) 4- wayC — £5) > 0, 








then ККТ conditions are satisfied, (y;,dy,) = (C,C — 15) is the unique solution. 
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A.1.2 SUB-CASE 1.2. IFA3 = 0 


When Аз = 0, we only conclude dy, € [—Y,,C — $]. 

Under the conditions A; Æ 0 and Аз = 0, we will discuss the two cases Ал Æ 0 and А = 0, respec- 
tively as follows. 

Sub-case 1.2.1. If 44 #0. Since A4(—dy, — 35) = 0, we have dy, = —Yp. Plugging the results А = O, 
Ya = С, № = 0 and dy, = —6/, in to the zero gradient conditions: 








kaC 4 Wab( 4») la+ =0 and kp( 4») - wapC — bp — № = 0. 


But since kj(—$5) < 0 wapC < 0 and £5,À4 > 0, Кь(—ь) + wapC — bp — № < 0, which contradicts 
the equation above. 
Sub-case 1.2.2. If 4 = 0. Plugging the conditions Ya = C, Ао = 0, Аз = 0 and Ay = 0 into the zero 
gradient equations: 











k4C + Wapdy, —y th, =0 and kpdy, +wapC — £j = 0. 


Solving the above equations leads to the following: 





_ W24C — wale — kakC + kpla 


£y — WapC 
M = ша 
1 m 


kp 





and dy, = 


If Hg Wah CH 
b 





> 0 and noe 2С c [-4, C —%%], then the ККТ conditions are all satisfied; as 





a result, (Ya,dy,) = (C, Stee) is the unique optimal solution. 
Next we will discuss the situation with the condition A; = 0. 


A.2 Case 2. If 4; =0 


Under the condition А = 0, we only conclude Y4 € [0,C]. We will discuss the cases Л 4 0 and 
№ = 0 under the condition A; = 0, respectively. 


A.2.1 SUB-CASE 2.1. IFA2 #0 


Since А (—Үүл) = 0, we conclude y, = 0. Under the conditions A; = 0 and А Z 0, we will discuss 
the cases Аз Æ О and Аз = 0: 

Sub-case 2.1.1. If A3 #0. Since As|dy, — (C — $5) = 0, plugging the conditions № = 0, Ya = 0, 
dy, = C —% and A4 = 0 into the zero gradient conditions: 


Wap(C — Yo) —(,—22-0 and ky (C — 85) — f +3 = 0. 
Since wap < 0, C — $5 > 0 and £, > 0, we conclude 
А = Wap(C — Yo) — 1, < 0. 


But Az > 0 and А Z 0, conclude А > 0, which contradicts the inequality above. 
Sub-case 2.1.2. If a = 0. Under these known conditions, we only know dy, € [-$5.C — $]. Below, 
we will discuss the cases Ал Æ 0 and Ад = 0, under the conditions A; = 0, А Æ 0 and Аз = 0. 
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e If A4 £0, since A4(—dy, — Yj) = 0, dy, = –ўь. From the conditions № = 0, y, = 0, А = 0 
and dy, = —Y and the zero gradient conditions, we have 


Wap(—q5) – а = № = 0 and kj(—85) — £p — № = 0. 





Since kp, > 0 and 4, > 0, we conclude 
А = Кь(—%ь) — £p <0. 


But the equation above contradicts Ад > 0. 





e Else if Ay = 0, from the conditions A; = 0, y; = 0, A3 = 0 and 24 = 0 and the zero gradient 
conditions, we have 





Wap, ==. la — № =0 and kody, a Ly = 0. 
Since wap < 0, £5, la > 0 and kp > 0, 
£ 
ho = wab — laS 0, 
kp 
which contradicts А > 0 (Since А Æ 0). 


A.2.2 SUB-CASE 2.2. IFA2 = 0 


Under the conditions A; = А = 0, we only know Ya € [0,C]. Below, we will discuss the two cases 
Az Æ 0 and Аз = 0, under the conditions A; = А = 0. 

Sub-case 2.2.1. If às #0. Since A3[dy, — (C — $5)] = 0, dy, = C — T5, as a result A4(—C) = 0, so 
М. = 0. From the conditions A; = Az = Aq = 0, dy, = C — $ and zero gradient conditions: 








Кайа + Wap(C — Yo) —l,=0 and kp(C — 4) + WabYa — £p +Аз = 0. 
As a result, if 


La = Wab(C mA 4») 
Ка 


La = Wab(C = 4») 
Ka 





€[0,C] and ¢,—ky(C—Y)—Wap 





> 0, 


the unique optimal solution is (Ya, dy,) = үе C9). c — $). 
Sub-case 2.2.2. If às = 0. According to A; = А = Аз = 0, we only conclude dy, € [-75.C — $]. 





e If Aq 5 0, since A4(—dy, — 5) = 0, dy, = —Yp. From № = № = А = 0, dy, = —Y and zero 
gradient conditions: 








Кайа + Wab(—Yo) = ба = 0 апа = kp(—Yp) + wapa — Lo — № = 0, 


since А = kp(—Ẹp) + wapya — © < О which contradicts with the condition А > 0. 








e If Ay = 0, from А = А№ = А = Ay = 0 and zero gradient conditions: 
Кайа + Wabdy, —í(,-—0 and kody, + WabYa — £p = 0. 
As a result, if Y4 and dy, satisfy the following: 


kyla = Waren kalb — Wala 


Ya = kako — w3, € [0,C] and dy, = 





€ Ww, C Vl, 
БЫ; l- C — l 





— (kbla—Waplp Кабь—Уарба ү ; : | : 
then (Ya,dy,) = ( iuro da pc. ) is the unique optimal solution. 
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Summary: The final closed-form solution to the optimization is summarized as: 


C,C —5,) if (kaC +wav(C — o) — la) < O and (kp(C — $) + wan — £5) < 0 
£p— 5—26) if Man Want Kofi hota > 0 and € [ 4, C $] 


( 

dy) = V 

Ya; 4y,) = (awale) waplC— w) С) if fa Wa (C Ap) € [0,C] and £, Ж дь) wap 401—0) >0 ` 
( 


а 
kpla— m s маа) if cun Meg € [0,С] x [-535.C — 55] 
akb—Wap aKb ар 


£p— It 














kakp— „2, ? kakp— „2, 


Appendix B. The Proof for Proposition 3 
Proof First of all, the product H (Y;) - H(Y;) can be simplified as: 


H(Y;)- H(Yj) = } o(i,a)o(i,b) = o(ra,a)o(r;,b) +С(ѕа,а)с(ѕа,р) = o(ra, b) — o(sa,b). 


Me 


i=1 


We can check the value of o(r;, b) — 6(sa,b) by examining all possible cases as follows: 


1 If rg = rp that implies that x, and x, have the same relevant labels, then we should have 
H(Y,)-H(Yp) =1—6(sq,b) > 1 (either 1 or 2); 


2 If rg Z rp, then: 


2.1 If ra = sp, then Hy, - Hy, = o(sy,b) — o(s;,b) = —1—G(Sq,b) < –1; 
2.2 If ra Z sp, then Hy, - Hy, = o(r4,b) – б(5„,Ь) = а 

2.2.1 If s; = sp, then Hy, - Hy, = —с(5ь‚Ь) = 

2.2.2 If s, = rp, then Hy, - Hy, = —0(rp,b) = 

2.2.3 If Sa Asp and sa Z rp, then Hy, - Hy, = e = 0. 


We thus have the fact that H (Y4) - H (Yp) < 0 holds if and only if (rg = sp) or (Sa = rp). ш 


Appendix C. The Proof of Proposition 4 


In this appendix, we will derive the dual ascent by the multiclass double updating approach. Our 
approach to the proofs is mainly inspired by the study in Shalev-Shwartz (2007), but our problem is 
different from their study. 

For the convenience of our presentation, we introduce the following notation for our derivation. 
We denote the loss function for a training example (x, Y ) as follows: 


a(f) = (б) = max, [1 (56) —5))] 


We order all the classes r in the assigned set Y as r1,:++ ‚гуу, and the class s in the unassigned set 
Y\Y as 51,--+ ‚5үдуу|. We slightly abuse our notations by simplifying (f, 9) 9; as (7, в) and |||, 
as || || when there is no ambiguity about the space for computing dot product and norm. 

We first give a lemma that shows the Fenchel conjugate of the above loss function g. 
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Lemma 2 Let Y = |k] be the possible labels set. Y C Y is relevant labels set for x € К". f = 
Gs fil, where Vi € |k], f; € Hg. And the loss function is defined as follows: 


s) max [1 0560 7 50)] 


rcY,sgY 
Then for any № = (i, Ax)’, where Vi Mj € He, we have g’s Fenchel conjugate as: 


е) = —у%у%у ЎА, +У,;00к(х,:) = 0 and As, – Y;0;jk(x,-) = 0 
оо otherwise | 


МЕ RE IYI) 


where à = (о) € A = [A|A € RP? ЛА < 1] and (rix s;) € 8 =Y x ([k]/Y). 


Proof The approach of our proof is similar to the method for proving the “Max-of-hinge” in Shalev- 
Shwartz (2007). First of all, it is not difficult to show that the loss function can be re-formulated as 
follows: 


s() = шах La{- (5,6) —5,62)] 


aE A, (rixs;)c8 


= тах Fat (fi; (- ) K(x, 3) = 00), )))]. 


бЄЛ,(түх5у)Є 8 


As a result, we have: 


£8) =max (8) а 
al j 


k 

= max У 05.5) - вёй A „= ( fs C) = (л, Ө,к@,.)))] |; 
For any fn, An Є Hg, they can be written as: f, = B,k(x, -) + ft, An = Ynk(x,-) - AL, where f+, At € 
V+, V = span(k(x,-)). As a result, we have 


k 
*/N. е ШР. 12 A" = 
£8) = max У, (0.5) Botox) ymax You [1- (86 7 xa) J 
When AZ Z 0, the max ft (Aj, fit) will be оо, resulting g* (À) = оо. Otherwise, if A] = 0, Vn, the 


term f+ does not take effect for the objective; as a result, the optimal f; can be written in the form 
of B,K(x,-) and the conjugate is computed as follows: 


в* б) =max { y Вһү„к(х,х)— max Les [1 — (B,,«(x,x) — B,,«(x,2)) | } 


n nel Q€A,(rixs;)e8^ 


ag { È бик) -Dos [17 (tnt) — (Bt) )]) 


Bn QEA, (rixsj)EB 


= min max { Y бб )-hts [1 — ((В„к(х,),к(х,)) — (Bs). ) } 


DEA, (rixs;))e8 B, n=l 


= min {- 204 ы (B;,«( x, -), Ar +} аук х) )+ DK Bs Go). AD aiye). 


ac A, ( (rixsj) JEB 
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The fourth equality is guaranteed by the strong max-min property (Boyd and Vandenberghe, 2004), 
and more importantly, we can see that only when à satisfies А, + X; oujK(x,-) = 0 and Às, — 
Yi; jK(x, -) = 0, the second term in the equation above will be zero; otherwise, it will be оо. There- 
fore, we have the resulting Fenchel conjugate of g(f) as follows: 


£0)-1 -Yi05 М +Ууоук(х,-) =O and As, = Liaj) 20 


otherwise 


Given the above Fenchel dual of loss function, we can derive the dual for the optimization problem 
given on the right-hand side of Equation (4), as given in the following lemma. 


Lemma 3 Suppose the complexity measure function is given as F(f) = Y, 1| fill? z and we set 
0; to zeros for V(i, j) € (lY; x ([K]/Y,)]/(ri,s;) where (r,s) is defined in Equation (2). Then 
the dual objective function for optimization given on the right-hand side of Equation (4) can be 
expressed as follows: 


k T 
DM," үг) --l5 PX: i, t) )y( Xt,’ JP iyw 
12 11 i=l 
1 ifi—n 
where ү, € [0,C] and o(i,t) =< —1 ifi=s, 


0 otherwise 


Proof The proof here resembles the one in the section 3.2 of Shalev-Shwartz (2007). Firstly, we 
note that the problem (4) is equivalent to the following: 


Т "n - - 
ыш (РО) Calf) st. fo. fi € Ae and Vt € [T], f] = fo 
OSI. ST к= 


By introducing T function vectors À4,--- , Ar, in which each A, = (ua, Ark) € Ay is a Lagrange 
multipliers for the constraint f; — fo, we can obtain t following Ен 


L(fo. frons ? Ar) = F (fo) „Усы fi) >) (А. fo Р). 


The dual objective function can be derived as follows: 


D(i,--- Ar) = git , Домо) 
T _ 3 T s И 
licis о. 
0 t=1 =l ft 
$ T T T РЙ 
= Е EEN) = Cgi)“ F*(— 2) = 27 Cg (c) 
1 "emn t=1 t=1 


Because F (f) = УХ, А2, we have F* = F. The dual problem thus becomes: 


k T (t 
D(i,--, Ar) = zl E №15, = Lexi С^ 
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Because we want to maximize the dual objective, according to Lemma 2, we should set 














№, s. 
C ij ( -) =0, e - ioo.) = 0, 
where (0) € A (rix s^) € &, A = АЈА € RII х REM Ај. < 1] and B = Y, x (0/0). 


Furthermore, we set Qt; to zeros for V(i, j) € (ӯ x ([k | Ү,)|/ Gan For simplicity, we denote 


O, s, as x. As a result, the dual objective function becomes 


Т 
oliye + E v 


t=1 


PMs) = =, 


IMS 


1 
D 


ТМ ~ 


1 if i = г 
where y, € [0,C] ando(i,t)= 4 —1 ifi—-s, . m 
0 otherwise 


By applying Lemma 3, we thus have the dual objective function for the t-th step as: 


Р, (үт, ү) ре. 


ira 


d t 
2 PE фуд? + Dw (10) 


j=l 


Now our goal is to derive the dual ascent guaranteed by the proposed double updating scheme. 
When pair (xa, Ya) is misclassified by the prediction function f; = (fy, , i). we will perform 
the update on the prediction function. Assume we conduct a double updating for (xq, Y,) and some 
auxiliary example (xp, Yp), we can prove Proposition 4 as follows. 


Proof According to Equation (10) obtained by Lemma 3, before performing the double updating, 
the value of the dual function is expressed as: 


k 1—1 1—1 
1 лк А 
Di--y zll Loli Dyka IP + У = ay Lis. ull? + E» 
il^ jl j=l i12 
where Ẹ;’s denote the weights of the prediction function f;..; before the updating. After performing 
the dual update, the value of the new dual function can be written as: 


k 


Bes Y; l| fiii 0, a) Yak (Xa, з) + 6G, Б) а, -) ^ DNI 
i= 12 j= 


Hence, the dual ascent is computed as follows: 
AD=Di -Di = Yo 1 = (Siirala) = fizi sa 09) ) +4ь(1— (i-is G9) = fiia 09) 
k 
һа) 


—Yosa — d? sp — У ,б(ї,а)б(ї,Ь)ү„йу„К(ха,хь). 
pel 
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